Thursday, March 31, 2011

The Advantages/Disadvantages of XML compared to RDMS

Are there disadvantages of using XML, instead of RDMS? The reason I ask this is because my data is more naturally represented by XML structure, rather than RDBMS. I initially thought of storing the data in relational database, but the lack of flexibility of relational database to handle tree-like data structure was putting me of. So I am thinking about just storing the data in XML.

One thing I fear is performance penalty. While RDBMS can handle large datasets, I am not sure whether the same can be said about XML. Also, the database queries are pretty well-established and fairly easy to use and construct, what about XML queries? I don't know.

I am doing .Net application.

From stackoverflow
  • Two big inherent advantages of RDBMS are:

    1. Indexing. Greatly enhances performance.
    2. Constraining. You can define relationships between elements which helps maintain the integrity of your data.

    Keep in mind you can put xml in sql server and query it using xpath, so depending on the shape of your data, you may be able to get the best of both worlds.

  • In my opinion, these are the factors to consider

    1. Which fits your applications needs more closely
    2. How large a data set you need to handle?
    3. Are you transferring data between applications or are you going to query it?


    Once these factors are considered, I would suggest that you decide to use RDBMS, if you have large data processing and querying needs and XML if you need to export data or transfer it between applications. i would also like to suggest that you consider constraints on your data and integrity needs like Nick has suggested.

    I have little experience in the area, however this is what I have heard from others at my school.

    All the best.

  • You should not compare XML with an RDBMS, since that are 2 complementary technologies; XML should not be considered, or regarded as a replacement for an RDMBS.

    An RDMBS is for storing large amounts of data in a consistent way. The RDBMS should take care of the consistentcy of the data, etc ...

    XML can be used for data-exchange between different computer systems for instance, but it should not be used to store large amounts of data over a long period of time.
    Xml doesn't allow you to take care of data-consistency like an RDMBS does; it doesn't take care of transactions, etc... Xml is actually nothing more then a text-file, that contains data in some kind of structured way.

    annakata : +1 - it's more like comparing DBs with Files
  • You can have the best of both worlds, your data can be stored in the database, and that has to be a better solution. As a DB is faster, more secure, has backup and restore, rollback, admin tools and so on ....

    It sounds as though your data is hierachial in nature, databases can be coerced to store hierarchies without too many issues.

    When it comes to using your data if you extract it as Xml. I know if you're using Sql Server that works out of the box, not so sure for Oracle.

  • Things an RDBMS provides that XML doesn't, more or less in order of importance:

    • enforcement of a defined schema (though this is certainly available to XML)
    • support for multiple writers
    • atomic transactions
    • referential integrity
    • well-defined query language
    • ability to optimize access through indexes, compiled queries, etc.
    • role-based security
    • triggers, stored procedures, calculated columns, etc.

    Plus you don't need to load the entire database into memory before you can access any of it.

    XML's an okay serialization format for an object model. It's good for hacking together relatively free-form data models that you can access with XPath, too - especially if you're going to transform that data into XML or HTML via XSLT. And it has the merit of being standard and platform-independent.

    But if you get too ambitious with it, you swiftly get into the kind of territory that results in you writing rants about how terrible XML is. (I'm talking to you, Jeff Atwood.)

Is monkey patching/class-reopening really an example of reflection?

Apologies for the recursive nature of this question but the chosen answer to a question on SO got me questioning my understanding of reflection.

I thought reflection was mainly about querying the internal happenings of a program while it's running. The example given in this response patches Ruby's built-in Integer class.

  • Isn't this more like function overloading/inheritance rather than runtime modification?
  • Is class reopening really an example of reflection?
From stackoverflow
  • Reflection can be used to implement late binding.

    Late binding can be used to implement monkey patching.

    Monkey patching can be used to achieve the sort of coding style shown in that answer.

    But there are other ways to implement such features that don't require monkey patching, or reflection. Heck, a good macro pre-compiler could get you close.

    So, technically correct, but not (IMHO) the greatest example.

    notnot : IYO, what would be a more effective example of reflection?
  • At the risk of increasing the level of recursion, I would like to respond although you are referencing my answer at that link.

    The misunderstanding is an easy one to make because of our intuitive understanding of reflection as referring to looking inwards. And that's certainly an important aspect of reflection in programming also - in Ruby, for example, we have methods like instance_of to allow objects to ask questions about themselves at runtime.

    But take a look at the wikipedia definition of reflection:

    reflection is the process by which a computer program can observe and modify its own structure and behaviour.

    As you can see, reflection is more than just runtime self-inspection. It's also the ability to change runtime behavior. Reopening a class is also referred to as "monkey patching". You can read more about it here.

    A monkey patch is a way to extend or modify the runtime code of dynamic languages without altering the original source code.

    This process is also referred to as: - Guerrilla patching - Extending previously declared classes - Reopening classes - Dynamic Funk - Hijacking - Duck Punching - Method Swizzling

Firebird to SQL Server Migration.

  1. Is there a tool native to SQL Server to do the migration (i.e., Import/Export)?

  2. Any other options (SSIS).

Please point me to helpful info.

From stackoverflow

What are good LAMP choices for freelance web hosting?

I think it's best if I ask this question with an example scenario.

Let's say your mom-and-pop local hardware store has never had a website, and they want you, a freelance developer, to build them a website. You have all the skills to run a LAMP setup and admin a system, so the difficult question you ask yourself is – where will I host it? You aren't going to host it out of the machine in your apartment.

Let's say you want to be able to customize your own system, install the version of PHP you want, and manage your own database. Perhaps the best kind of hosting is to get a virtual machine so you can customize the system as you see fit. But this essentially a "set it and forget it" site you make, bill by the hour for, and then are done. In other words, the hosting should not be an issue.

Given these hosting requirements:

  • Unlimited growth potential needing good amounts of bandwidth to handle visitors
  • Wide range of system and programming options allowing it to be portable
  • Relatively cheap (not necessarily the cheapest) or reasonable scaling cost
  • Reliable hosting with good support
  • Hosted entirely on the host company's hardware

Who would you pick to host this website? Yes, I am asking for a business/company recommendation. Is there a clear answer for this scenario, or a good source that can reliably give the current answer?

I know there are all kinds of schemes out there. I'm just wondering if any one company fills the bill for freelancers and stands out in such a crowded market.

From stackoverflow
  • Well, some good VPS solutions that allows for pain free upgrades and are really cheap are Linode and Slicehost. The problem here though is they aren't setup and forget..if they need an upgrade, you have to manually do it. However, with those 2 hosts, you order the upgrade and it is performed painlessly in less than 5 minutes. All your files will be intact.

    Based on your description, though, it sounds like you want a cloud host where you can just set up the server and have it automatically scale to what you need. In that case, you'll want to check out Amazon EC2 and Amazon S3.

    David Zaslavsky : +1 more for Slicehost coming from a very satisfied customer ;-) Slicehost is really meant for people who want to get involved in the "dirty work" of maintaining a server, i.e. upgrades and such. If you want to set it and forget it, VPS isn't really the way to go.
    Brendan Long : The benchmarks I've seen show Linode being faster, and it comes with more memory, disk space and bandwidth. EC2 is more scalable (more scalable than 99% of people will ever need), but it's also more expensive and the latency is higher.
  • I've used RimuHosting, they have great service (respond in minutes a lot of the time). They'll see you up with a Virtual Server however you want and you get root access and get configure it how you'd like. If you need help with something, they've always helped me very quickly. You can pick between whichever distro or software you'd like.

  • I've been extremely pleased with webfaction http://webfaction.com. They have stock installations of several popular applications and frameworks (PHP, Django, Drupal, etc.) However, you're not locked into these. While they don't give you root access, they do give you access to a complete toolchain allowing you to compile and install whatever version of whatever components you need.

    I've compiled and installed Erlang, ejabberd, couchdb, rabbitmq, activemq, openfire on my server with only minor hitches mostly due to ignorance on my part, not their system.

  • I've been using site5 http://www.site5.com/ for a number of years now and would definitely recommend them. They support PHP, Ruby on Rails and Python and allow SSH access so you can get quite a bit done. Their support is awesome and they often let you install arb software (they let me have mercurial before it was standard on their setup).

Open files in Word via ribbon code-behind

Using VSTO, I've created a custom tab in the Ribbon designer and added some groups and button controls there. When user clicks one of the buttons, I'd like to connect to a SharePoint site and open a word document from it in Word (an instance is already open). I'm able to connect to the SharePoint site already and have the URLs to the documents I want to open.

But how can I actually load these documents into Word? I'm already in the code-behind in Word, so how can I target the Word instance I'm in and open a file there?

Thanks in advance.

From stackoverflow
  • You would have to use the Word API to open a document. See this link for a reference. You may have to update it based on the API version you use.

    private void button1_Click(object sender, System.EventArgs e)
    {
        // Use the open file dialog to choose a word document
        if (this.openFileDialog1.ShowDialog() == DialogResult.OK)
        {
            // set the file name from the open file dialog
            object fileName = openFileDialog1.FileName;
            object readOnly = false;
            object isVisible = true;
            // Here is the way to handle parameters you don't care about in .NET
            object missing = System.Reflection.Missing.Value;
            // Make word visible, so you can see what's happening
            WordApp.Visible = true;
            // Open the document that was chosen by the dialog
            Word.Document aDoc = WordApp.Documents.Open(ref fileName, ref missing, ref readOnly, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref isVisible);
            // Activate the document so it shows up in front
            aDoc.Activate();
            // Add the copyright text and a line break
            WordApp.Selection.TypeText("Copyright C# Corner");
            WordApp.Selection.TypeParagraph();
        }
    }
    
    Kon : Yeah, that's what I've got working now. So it kind of works, but I have an issues with this... it opens in a new Word window, not the instance I was initially using. Is there a way 'fix' that?
    Kon : I found my answer here: http://social.msdn.microsoft.com/Forums/en-US/vsto/thread/b6fa2787-bf87-4ef2-9c99-9df9f2c0a202/. Had to use Globals.ThisAddin.Application.Documents.Open(...)

Remote HTTP Post with CSharp

How do you do a Remote HTTP Post (request) in CSharp?

i really needs this pls. :(

From stackoverflow
  • HttpWebRequest

  • You can use WCF or create a WebRequest

    var httpRequest = (HttpWebRequest)WebRequest.Create("http://localhost/service.svc");
    var httpRequest.Method = "POST";
    
    using (var outputStream = httpRequest.GetRequestStream())
    {
        // some complicated logic to create the message
    }
    
    var response = httpRequest.GetResponse();
    using (var stream = response.GetResponseStream())
    {
        // some complicated logic to handle the response message.
    }
    
  • I use this very simple class:

     public class   RemotePost{
         private  System.Collections.Specialized.NameValueCollection Inputs 
         = new  System.Collections.Specialized.NameValueCollection() ;
    
        public string  Url  =  "" ;
        public string  Method  =  "post" ;
        public string  FormName  =  "form1" ;
    
        public void  Add( string  name, string value ){
            Inputs.Add(name, value ) ;
         }
    
         public void  Post(){
            System.Web.HttpContext.Current.Response.Clear() ;
    
             System.Web.HttpContext.Current.Response.Write( "<html><head>" ) ;
    
             System.Web.HttpContext.Current.Response.Write( string .Format( "</head><body onload=\"document.{0}.submit()\">" ,FormName)) ;
    
             System.Web.HttpContext.Current.Response.Write( string .Format( "<form name=\"{0}\" method=\"{1}\" action=\"{2}\" >" ,
    
            FormName,Method,Url)) ;
                for ( int  i = 0 ; i< Inputs.Keys.Count ; i++){
                System.Web.HttpContext.Current.Response.Write( string .Format( "<input name=\"{0}\" type=\"hidden\" value=\"{1}\">" ,Inputs.Keys[i],Inputs[Inputs.Keys[i]])) ;
             }
            System.Web.HttpContext.Current.Response.Write( "</form>" ) ;
             System.Web.HttpContext.Current.Response.Write( "</body></html>" ) ;
             System.Web.HttpContext.Current.Response.End() ;
         }
    }
    

    And you use it thusly:

    RemotePost myremotepost   =  new   RemotePost()  ;
    myremotepost.Url  =  "http://www.jigar.net/demo/HttpRequestDemoServer.aspx" ;
    myremotepost.Add( "field1" , "Huckleberry" ) ;
    myremotepost.Add( "field2" , "Finn" ) ;
    myremotepost.Post() ;
    

    Very clean, easy to use and encapsulates all the muck. I prefer this to using the HttpWebRequest and so forth directly.

    BobbyShaftoe : Why is this getting downvoted?
    David : If I'm reading this correctly, it doesn't actually post a form, but responds with a form that can be posted.
    CodeMonkey1 : I downvoted because it only works in the context of a web page response and even in that case it kills whatever else you may have wanted to do in that page. Also it only allows for a fire & forget post, and is a convoluted way to do it.
  • Use the WebRequest.Create() and set the Method property.

  • HttpWebRequest HttpWReq = 
    (HttpWebRequest)WebRequest.Create("http://www.google.com");
    
    HttpWebResponse HttpWResp = (HttpWebResponse)HttpWReq.GetResponse();
    Console.WriteLine(HttpWResp.StatusCode);
    HttpWResp.Close();
    

    Should print "OK" (200) if the request was successful

    bendewey : Since the OP is doing a POST you should mention the request stream side as well.
  • Also System.Net.WebClient

  • This is code from a small app I wrote once to post a form with values to a URL. It should be pretty robust.

    _formValues is a Dictionary<string,string> containing the variables to post and their values.

    
    // encode form data
    StringBuilder postString = new StringBuilder();
    bool first=true;
    foreach (KeyValuePair pair in _formValues)
    {
        if(first)
         first=false;
        else
         postString.Append("&");
        postString.AppendFormat("{0}={1}", pair.Key, System.Web.HttpUtility.UrlEncode(pair.Value));
    }
    ASCIIEncoding ascii = new ASCIIEncoding();
    byte[] postBytes = ascii.GetBytes(postString.ToString());
    
    // set up request object
    HttpWebRequest request;
    try
    {
        request = WebRequest.Create(url) as HttpWebRequest;
    }
    catch (UriFormatException)
    {
        request = null;
    }
    if (request == null)
        throw new ApplicationException("Invalid URL: " + url);
    request.Method = "POST";
    request.ContentType = "application/x-www-form-urlencoded";
    request.ContentLength = postBytes.Length;
    
    // add post data to request
    Stream postStream = request.GetRequestStream();
    postStream.Write(postBytes, 0, postBytes.Length);
    postStream.Close();
    
    HttpWebResponse response = request.GetResponse as HttpWebResponse;
    
    
    Liam : Thanks, the details on how to build the POST data really helped!
  • Im using the following piece of code for calling webservices using the httpwebrequest class:

    internal static string CallWebServiceDetail(string url, string soapbody, 
    int timeout) {
        return CallWebServiceDetail(url, soapbody, null, null, null, null, 
    null, timeout);
    }
    internal static string CallWebServiceDetail(string url, string soapbody, 
    string proxy, string contenttype, string method, string action, 
    string accept, int timeoutMilisecs) {
        var req = (HttpWebRequest) WebRequest.Create(url);
        if (action != null) {
         req.Headers.Add("SOAPAction", action);
        }
        req.ContentType = contenttype ?? "text/xml;charset=\"utf-8\"";
        req.Accept = accept ?? "text/xml";
        req.Method = method ?? "POST";
        req.Timeout = timeoutMilisecs;
        if (proxy != null) {
         req.Proxy = new WebProxy(proxy, true);
        }
    
        using(var stm = req.GetRequestStream()) {
         XmlDocument doc = new XmlDocument();
         doc.LoadXml(soapbody);
         doc.Save(stm);
        }
        using(var resp = req.GetResponse()) {
         using(var responseStream = resp.GetResponseStream()) {
          using(var reader = new StreamReader(responseStream)) {
           return reader.ReadToEnd();
          }
         }
        }
    }
    

    This can be easily used to call a webservice

    public void TestWebCall() {
        const string url = 
    "http://www.ecubicle.net/whois_service.asmx/HelloWorld";
        const string soap = 
    @"<soap:Envelope xmlns:soap='about:envelope'>
        <soap:Body><HelloWorld /></soap:Body>
    </soap:Envelope>";
        string responseDoc = CallWebServiceDetail(url, soap, 1000);
        XmlDocument doc = new XmlDocument();
        doc.LoadXml(responseDoc);
        string response = doc.DocumentElement.InnerText;
    }
    
  • The problem when beginning with high-level language like C#, Java or PHP is that people may have never known how simple the underground is in reality. So here’s a short introduction:

    http://reboltutorial.com/blog/raw-http-request/

Examples of large scale Open Source CMS deployments?

I am trying to evaluate Open Source options to replace my current CMS based publication application. My current CMS has about 12000 HTML pages and about 100000 uploaded files. The size of the data is about 20 Gigabytes. Drupal, Joomla and Plone seem interesting. However, I am concerned if these are ready to take on all this data. Do you know any large scale (comparably sized) CMS deployments - any supporting numbers will greatly help.

Please not that my CMS application is a publishing system and not a collaborative/social network type site.

From stackoverflow
  • Drupal, in particular, focuses on performance. It has several types of internal caches, and, combined with a PHP cache (such as APC, which I use on my sites), it is quite performant. As of Drupal 6.0 the menu system (which drives the whole page-request structure) was totally rewritten for optimization purposes.

    My largest Drupal community has about 800 users, about 1300 content pages, and a couple thousand uploaded files totaling around 3 GB, and experiences sub-200ms page loads. It's about 1/10 the size of your site, but since you don't need community features (which generally require a lot of custom database queries), you should experience comparable performance.

    Drupal's home site, drupal.org, has about 430000 users, and about 400000 pages, and gets similar page load times (although they're running a cluster of servers).

    So I'm pretty confident Drupal should be able to handle your site.

  • fastcompany.com launched with ~750,000 pieces of content on day 1. They had performance and scaling problems initially, but it was related specifically to the fact that large-scale faceted search of the entire content base turned out to be the most popular feature, and they weren't using a dedicated search indexing system.

    The New York Observer converted to Drupal a while ago, and their scaling problem had nothing to do with the amount of content; it was straightforward "how to handle Drudge and the Huffington Post both linking to you at the same time during the election season"

    The Onion, Lifetime Television, and a number of other pretty large sites use Drupal. Mother Jones magazine just converted to it. NowPublic.com, the crowdsourced news site, also runs on Drupal and has been since the (much slower) days of Drupal 4.7.

    The key scaling issue is not really how many discrete pieces of content you have, but rather the kind of slicing and dicing you'll be doing with your queries. Those are optimized ad-hoc, like any other SQL query. Drupal tends to focus on optimising for small to medium sites out of the box, and the larger stuff requires prodding around at the indexes and paying attention to how you build your Views-based pages (since they're basically just presentation logic wrapped around SQL).

    As an earlier poster noted, if you don't need lots of user-customized content ('stuff my friends have posted,' 'what my buddies are doing,' etc.) the amount of expensive querying drops dramatically.

  • I got to put a plug in for plone. I use it as a document repository which contains lots and lots of scanned images that are quite large. No problems so far but not yet the size that you are talking about.

    • Plone has an FTP based interface so that might ease your migration pains.
    • Plone is written on top of an application server technology known as Zope. Because of that, plone's default back end is the Zope Object Data Base or ZODB. You can substitute a RDBMS for ZODB.
    • You can reconfigure ZODB to be a database that is distributed across multiple servers. This is called ZEO.
    • There is also work in progress for a file based repository system for plone.

    There are lots of consulting companies who can give you the stats you are looking for. Here's the only case study that I could easily google.

System.Collections - why so many options?!

Most of my programming experience is in a language where there is one collection data structure -- an array. Now that I'm working primarily in .NET, I've come to appreciate the vast number of tools available to me, but I also find it difficult to determine which tools is best suited for each problem. I find this to be the case often with collections.

I'm sure I'll be able to spot the right tool for the job quicker with time/experience, but can anyone offer some guidance on which collection classes are good for which jobs? Any good rules of thumb to follow?

EDIT: I find that I use List(T) almost always, which is sort of what prompted this question. I know there are very specific reasons to use the other classes. Although List(T) works most times, I want to avoid jamming something into a generic list when another structure is better suited. I have to be able to spot these cases.

Thanks!

From stackoverflow
  • You didn't say what language you used before, but I feel pretty confident in saying that if you believe that array was the only thing available, then you were probably mistaken.

    C++ for example only supports array "collections" natively ("collections" used very loosely here), but with the addition of pointers you can implement an equivalent for any collections data structure available in .Net. In fact, if you look in the C++ standard template library you will find stock implementations for most of the common structures.

    The reason for the additional structures is that an array is not always, or even often, the most appropriate structure to use for a collection of data. It has a number of limitations that can be solved by one collection or another, and using those different collections you can often get much greater performance out of much less code, and reduce the chance there's a bug in your data structure implementation as well.

    When deciding what collection type to use, you need to look at how it will be used most ofen. For example, are all the objects in the collection expected to be of the same type, inherited from the same type, or any type? Are you going to be frequently adding and removing items? If so, will you always push/pop, queue/dequeue items or do you need to add items to specific locations? Will you lookup specific items by key, by index, or both? If by key, how is the key determined?

    Some of the more common collections:

    • List<T> should probably be used in most of the situations where you're used to using an array. It supports lookup by index using the same syntax as an array with performance approaching that of an array, is strongly-typed, and makes it very easy to add or remove items and very fast to append or pop items (inserting to a specific position is much slower).

    • LinkedList<T> should sound familiar if you've done any formal computer science training. It uses syntax similar to List, but is optimized differently: lookups are slower because they require traversing the list, while adding or removing an item to a specific position can be much faster.

    • Dictionary<TKey, TValue> uses syntax similar to a List<T>, but instead of an array index you put a key value in the brackets. Dictionaries are great because lookups of specific items by key are considered to be very fast, in that no matter how many items are in the Dictionary it will always take about the same amount of time to find the one you need.

    • SortedList<TKey, TValue> works much a like a Dictionary, with the exception that when you iterate over it items are returned sorted by key. However, you can't lookup the nth item without first iterating all the items before it.

    • KeyedCollection is often overlooked because it's hidden in a different namespace from some of the other collections and you have to implement a (very easy) function to use it. It also works much like a dictionary, with the addition that it supports easy lookup by index. It is normally used when the key for an item is a simple property of the item itself.

    • Don't forget the old standbys: Stack and Queue. Again, if you have any formal computer science education at all you should already have a pretty good idea how those work based on their names.

    Finally, most of these collections (array included!) implement a set of common interfaces. These interfaces are very useful, in that you can write a program against an interface rather than a specific collection, and then your function can accept any collection that implements that interface. For example, the following code will work whether you pass in a string array, a List<string>, or any other IEnumerable<string>:

    void WriteToConsole(IEnumerable<string> items)
    {
        foreach (string item in items)
        {
           Console.WriteLine(item);
        }
    }
    

    Other interfaces worth looking at include IList<T>, ICollection<T>, and IQueryable<T>.

    Thomas : Some things that you may want to add to your otherwise excellent reply: adding elements to List is only fast if you add them at the end; and mention LinkedList, which has very fast insertions and deletions anywhere, but does not support indexing elements directly.
    Jon Tackabury : +1 concise answer.
  • The collections like Stacks, Queues, SortedList, Dictionary, HashTable are all standard data structures which come in handy in various situations.

    Queue enables FIFO implementation without you having to do it yourself. Stacks give you LIFO. SortedLists gives you a presorted list and so on.

    There are many others in the collections namespace and there are all discussed here.

  • Generic Lists (List) are good for common use. They don't perform boxing and unboxing. so no performans problems.

    List<string> items = new List<string>();
    items.Add("abc");
    items.Add("dfg");
    

    ArrayLists accepts any object as item. so they are good for storing multiple typed situations. For example if you need to store an int and a string in same collection arraylist is good for this.

    ArrayList items = new ArrayList();
    items.Add("abc");
    items.Add(1);
    items.Add(DateTime.Now);
    

    SortedLists and Hashtables are store key-value pairs. you can define a key for your items. this helps you to find them quickly. SortedLists are automatically sorted Hastables.

    Hashtable items1 = new Hashtable();
    items1.Add("item1", "abc");
    items1.Add("item2", "dfg");
    
    SortedList items2 = new SortedList();
    items2.Add("Second", "dfg");
    items2.Add("First", "abc");
    

    Hope this helps !

  • Two tips I can offer: 1. Use Generic collections as much as possible. 2. When deciding between a HashSet and a List generic collection, really look at what you are going to be using them for. Hashsets may be faster at searching, but the also slow down with inserts (I have found).

  • Algorithms and Data Structures. Each one has its advantages and disadvantages, and each one has its purpose.

  • there are lots of posts related to this issue, you must think WHAT do you really need to do. do you need a string based key(¿) how data is goint to be populated, do you need a native method to find if any key exist, or if any value exist(?)

    Generics are the most used by me, but there is a reason for the others ;)

    http://discuss.fogcreek.com/dotnetquestions/default.asp?cmd=show&ixPost=5119

  • Like so many other things in computer science, when there are multiple choices, it usually means there are multiple ways of doing something. As others have said, there are various advantages and disadvantages of each collection. Regardless of whether you're using the generic versions of the collections or not, ultimately all collections provide these operations:

    • insert
    • update
    • delete
    • lookup
    • enumeration

    The different collections have different performance characteristics for each of these operations. For example, an array is quick to update an item, but takes longer to insert or delete an item. Lookup is very fast.

    Compare that with a List. The List is very fast to insert. Lookup takes longer. Update and delete operations require that you have the item already and is pretty fast. Enumeration for both an array and a List are about the same.

    All collections also have certain behaviors, for example, does the collection maintain sorted. If so, then the insert/update/delete operations will take longer but will speed up lookup.

    So depending on what your program is doing most of the time will determine which collection to use.

C# string won't concatenate

// Reads NetworkStream into a byte buffer.

NetworkStream ns;
System.Net.Sockets.TcpClient client = new TcpClient();

byte[] receiveBytes = new byte[client.ReceiveBufferSize];
ns.Read(receiveBytes, 0, (int)client.ReceiveBufferSize);
String returndata = Encoding.UTF8.GetString(receiveBytes);

I am successfully reading from a client and storing the result into a string called returndata. However, when I try to concatenate returndata with anything, no concatenation occurs. Ex: String.Concat(returndata, "test") returns returndata, as does returndata + "test".

Does anyone know why this is happening?

Edit: Steve W is correct; i found out later that returndata.Length was always returning 8192.

From stackoverflow
  • Are you assigning it to a string or back to itself?

    returndata = string.Concat(returndata, "test");
    returndata += "test";
    
    ShuggyCoUk : Psychic debugging - gotta love it :)
  • Strings are immutable. this means that, when you add (concatenate) another string to your string, you'll receive a new string instance.
    The original string itself won't be changed.

  • To expand on jhunter's answer, the Concat method doesn't alter the contents of the original string variable, it just returns the concatenated result. If you want returndata to contain the result of the concatenation, you need to use returndata = string.Concat(returndata, "test").

    abelenky : Its not just the Concat method, NO method directly modifies strings in C#. Every string is a unique, immutable object. If you change a string, it makes a new copy.
    Mike Powell : That's true--bad choice of words on my part.
  • I believe the problem is related to not keeping track of the total number of bytes read. Your byte buffer, set to ReceiveBufferSize, is more than likely larger than the actual number of bytes read. By taking into account the actual bytes read, and then passing it to the Encoding.UTF8.GetString() method, you should get a valid string that can be concatenated. Here's an example:

    
    NetworkStream ns;
    System.Net.Sockets.TcpClient client = new TcpClient();
    
    byte[] receiveBytes = new byte[client.ReceiveBufferSize];
    int bytesRead = ns.Read(receiveBytes, 0, receiveBytes.Length);
    String returndata = Encoding.UTF8.GetString(receiveBytes,0,bytesRead);
    returndata = returndata.Trim(new char[] {'\0'});
    
    

    Note also the other suggestion about reading a null terminator and including it in the bytes read also could be an issue. I've included a Trim of the string to make sure null terminators are removed.

ROR Model Setup Question: Table Relationships

I'm developing a site that will have a model for users, a model for submissions, and a model for ratings. Each submission may have only one rating per user. My question is how should I set this up in my models?

From stackoverflow
  • class Submission
      has_many :ratings
      validates_uniqueness_of :rating, :scope => :user_id
    end
    

Is there a better way of writing a git pre-commit hook to check any php file in a commit for parse errors ?

What I have so far is

#!/bin/sh

php_syntax_check()
{
    retval=0
    for i in $(git-diff-index --name-only --cached HEAD -- | grep -e '\.php$'); do
     if [ -f $i ]; then
      output=$(php -l $i)
      retval=$?
      if [ $retval -gt 0 ]; then
       echo "=============================================================================="
       echo "Unstaging $i for the commit due to the follow parse errors"
       echo "$output"
       git reset -q HEAD $i
      fi
     fi
    done

    if [ $retval -gt 0 ]; then
     exit $retval
    fi
}
php_syntax_check
From stackoverflow
  • Does this work? Is it a case of your code isn't doing what you need, or, does it have some limitations?

    Andrew

  • @Andrew Taylor This works ok but since it is the first git hook that I've written I was just wondering if there was a simpler or cleaner way to do it.

  • I'm sorry if it's offtopic, but aren't you supposed to run some kind of automated tests (which would imply that the code has no syntax errors) before doing a commit?

  • If you've got the php5-cli installed you can write your pre-commit in PHP and use the syntax your more familiar with.

    Just do something more like.

    #!/usr/bin/php
    <?php /* Your pre-commit check. */ ?>
    
  • If the commit is a partial commit (not all the changes in the working tree are committed), then this make give incorrect results since it tests the working copy and not the staged copy.

    One way to do this could be:

    git diff --cached --name-only --diff-filter=ACMR | xargs git checkout-index --prefix=$TMPDIR/ --
    find $TMPDIR -name '*.php' -print | xargs -n 1 php -l
    

    Which would make a copy of the staged images into a scratch space and then run the test command on them there. If any of the files include other files in the build then you may have to recreate the whole staged image in the test tree and then test the changed files there (See: Git pre-commit hook : changed/added files).

Regex to find special characters in a String with some exceptions

I had just got a similar (but not exact) question answered. I now need help with this question below.

I want to write a regex to match a character if its a non word, non digit and non star (*) character. So, the characters [0-9][a-z][A-Z] * should not match and any others should.

I tried writing [\W[^*]] but it doesn't seem it works.

I hope I made it clear. but if not, I apologize. Thanks much for the help.

From stackoverflow
  • Try this instead:

    [^\w\*]
    
  • The simplest regular expression that matches a single character, that is not one of those you described, independent of any particular regular-expression extensions, would be:

    [^0-9a-zA-Z *]
    
  • [^\w\*]
    

    Simple enough.

How can I prevent my web content from being repackaged by another site

I just noticed that one of my questions, http://stackoverflow.com/questions/449207/how-can-i-call-activatekeyboardlayout-from-64bit-windows-vista came up in a google search at another site. http://devmeat.com/show/898409

It got me thinking, why would devmeat repackage SO content? Web traffic, money, maybe even an altruistic desire to bring value to their readers.

So, is there anything a web programmer do to prevent this type of wholesale "repackaging" of content?

Note: I'm looking for a technological solution, not a legal one

EDIT: Here is the real world problem I'm trying to solve. I have spent the last 5 years creating a Punjabi x English Dictionary. I am interested in making it available through a web interface, but am concerned (maybe needlessly so) that someone will write a bot script to send over 30,000 English words and capture the translations.

Stephen writes below: "The whole informatics revolution is about being able to copy for (nearly) free"

So, now I am faced with the personal question about IP vs "gift to the world". BTW, I've made no decisions, just wrestling with the question.

David's comment "You can at least cut down on commercial use of what you put out..." strikes a cord with me. I give away a windows based version of the program for free. and after reading these comments, I've identified that my concern is that I don't want others to package my work and resell it. So maybe the solution is a legal one after all :)

From stackoverflow
  • Nothing that is worth the effort. At best you could convert all your text to images and hope no one OCRs it.

    Web content is downloaded to the client side. It's a simple as that. Anything visible on your site is public.

    Your other option is to hire a lawyer to sue for copyright infringement. (If you can find the dastards to sue)

    For a technical solution, you just want to make it harder for bots to steal your textual content? You have many choices, none of which are bulletproof.

    • Require users to login to see it.

    • Convert everything to a flash movie, that doesn't include selectable text.

    • Convert your text to GIF or PNG
      images (and increase the data size by at least 10x)

    There are others, but most people would advise you not to go that route, unless you can give a more specific situation and set of requirements.

    annakata : converting to images would be worse than letting the IP go imho...
    aleemb : increasing the data size by 10x is a very lousy idea. adversaries will take 20 hours instead of 2... so what? it will use up more of your bandwidth for genuine visitors and worsen their UX. login doesn't help either. a logged in user can also be an adversary.
    Piskvor : @aleemb: increasing size is a unavoidable side-effect of converting text to images, not a feature IMHO.
  • Technically, there's no way to prevent copying. At most it can be made a bit difficult, but it's certainly not worth it. Legally, you could just prohibit it. Then again, prohibiting doesn't prevent anything...

  • You can't really do anything about it. It's either in the public domain publicly available or it's not.

    EDIT: I'm not talking about whether it is legal or not, the poster is asking a technical question, I am suggesting that once you post it on the Internet, it effectively enters the "public domain" in the sense that any one can do what whatever they want do with it. Whether this is legal or not is irrelevant if the person doing it is quite happy to engage in illegal activities.

    KeithB : "public domain" is a legal term with a specific meaning, that the owner specifically waives all rights to the content and anyone can legally do with it as they like. Everything you write on SO is copyright by you. You mean publicly available.
    User : Nice remark. As much as I know even if you don't state anything on your page, it will not automatically put your content in the public domain.
    KeithB : As presumably technical people, I think that it is important that we use the correct terminology as much as possible. There is enough confusion around copyright and patents as it is.
    Rob : Thank goodness, I just scrolled down having written more or less the same comment about the misuse of the term "public domain" on another answer.
  • Well, the content on this site is released under a CC license so you can't prevent certain 'repackaging' as you call it. Generally you can't do much about it, except mailing them if they can remove it.

    Other than that Google and other search engines are constantly improving their duplicate content filters. Just don't bother too much with it, not worth your time.

  • Not unless you can pay to hire a team of lawyers.

    User : Team of lawyers can do nothing to the guys sitting in some other country especially if their local legislation does not punish for copying stuff.
  • You would base your site on the principles that anything on it is copyrighted to yourself. Therefore, if someone steals from you you can take legal action against it.

    Of course, this can never work on a site such as stack overflow as the entire concept is based around sharing your intellect to the community (the world).

  • If you obfuscate your content or deliver a bad experience to attempt to protect it you run the risk of fading into obscurity like devmeat.com will.

  • I imagine devmeat are doing some sort of Feed aggregation.

    SO publishes Atom feeds for each question - see http://stackoverflow.com/feeds/question/449207

    Plus a recent questions feed - http://stackoverflow.com/feeds

    So in effect SO is saying - "Here - please have this content."

    With your own content on your own site, once it's on the Internet there is not really anything technically you can do to stop people - as others have said it's just a legal issue.

  • Just because something is publically available it does not mean that there is no Copyright (in the UK). The author still owns the copyright to the work, but as mentioned above there is nothing you can do to PREVENT it happening from a technical perspective.

    Thats the nature of HTML and the whole way the World Wide Web works, blame TBL (actually blame the nefarious individuals who cannot think of their own original content).

    EDIT: Removed reference to the term 'Public Domain' as I did not mean it in its legally defined usage.

    DrG : Oh yes I agree, I'm just saying that id it is in the "public domain" you can't actually stop people from taking it.
    Rob : Ugh, confusion of terminologies. "Public domain" is a term which, when applied to intellectual property, does in fact mean that there is no copyright, either because it has expired, or because it was never eligible for protection in the first place. Don't confuse "in the public domain" with "published on the World Wide Web".
  • If the process is an automatic one then you can take some steps, such as: only include brief snippets of your content in all your feeds. You can also output a chunk of your body text using javascript which means that most automated solutions will miss it out. Unfortunately, that also means search engines won't index that content either. You can't have it both ways! :)

  • Technically: You can't do anything about it.

    (You can make it harder, but this usually makes it harder for your users to use the site too, so don't.)

  • The whole informatics revolution is about being able to copy for (nearly) free. Artificially restricting that (by law) only reduces your ability to compete. Business models based on copying and distribution being expensive (publishing, music) will be replaced.

    [edit] The same technology allowing you a public of millions also allows everyone to copy your content. Making money out of it can be done by providing added value:

    • letting people know you're the expert who created it and can do consultancy/ paid extensions;
    • being faster/more up to date. You might be able to improve the content faster than the bots pick it up;
    • simply asking money to the part of the market that likes support. A market exists only of the people who are willing to pay for it.
    Noah : Nice point. This helped me clarify what and why I was looking for this solution; See my EDITS above
  • If you have to remain text-based, about all you can do is monitor and filter. If you know, by your logs, you're getting a disproportionate measure of traffic from a particular source, you can deny requests from that source, using one or more properties of the request. It's a crazily moving target, and totally unguaranteed, but it's an option.

    Also, if you're not publishing feeds intended for others to read legitimately, you can vary the structure of your documents (assuming you're generating them dynamically) in slight ways that'd disrupt screen-scraping efforts. Again, totally not guaranteed, and likely to have adverse effects, but something.

    If securing your content is that important enough to you, though (as it was for one of my clients), Flash is definitely an option. Provided you can get the content to the SWF securely, and you code your Flash app to support deep linking, your visitors will be able to read it, and (with the Flash search player currently under active development), your content will be search-engine findable as well.

    Even so -- no guarantees. Copy-paste, OCR, etc. -- there'll always be workarounds. The only question's how far the hack's willing to go to achieve them. All you can really do is deter.

    Christian Nunciato : Why the downvote? Something wrong with my post?
  • In the short term, you can convert your site to AJAX - none of the content is contained in the page originally downloaded, instead javascript is executed which locates the necessary content and displays it.

    This would require the bot authors to specifically attack and customize their bot for your site either by analyzing the javascript (which is less effective since you can merely change the script a bit, and the URLs it pulls from to make them go again) or by implementing a javascript engine and then ripping the 'rendered' page (maybe use greasemonkey).

    Either option is painful, and unless you have very desirable content it's not worth it, so it's as effective as it needs to be.

    If your content is very valuable, though, then the only thing you can do is make it hard for quick hackers to get at (such as the above) and then employ bots to search for infringements and automatically send DMCA takedown notices. This is relatively hands-free work, so it's not as onerous as you might think, and it is reasonably effective.

  • You could limit the amount of content that one user (IP address, presumably) is allowed to receive through your web interface, similar to how Google Books restricts the amount of pages you can view for some books.

    Not that it's hacker-proof, but it could be one approach.

  • The general rule is that, if I can see it on a computer, I can copy it. People have been trying to change this for years, not very successfully. The more successful attempts involve carefully written software that needs to be run in order to see whatever (this is usually done through cryptography), and resists use for any other purposes.

    As it happens, people don't generally surf the web with special software; they surf it with programs like Firefox and Internet Explorer and Safari and Opera. There's no way you can serve information up to a web browser and still keep any control over it.

    Your only recourse is legal. Put a copyright notice on the page, and decide who you're prepared to take legal action against. You can at least cut down on commercial use of what you put out in that way, although it won't be worthwhile to go after noncommercial use. There is no technological solution.

    One thing to consider if you do decide to put your dictionary on-line: provide a full computer-readable download with whatever copyright license you like (there's a large variety of creative commons licenses, and one may be suitable for you). That way, fewer people will hit the site tens of thousands of times to copy your dictionary word by word.

  • Devmeat is NOT ripping any content as you are trying to say (or I misunderstood ?). Devmeat.com is RSS/feed aggregator specialized for software developers. For me, basically it's place I can 'go and check what's new' between work tasks and I have filtered/categorized news. It's using RSS which is publicly available on stackoverflow (and many other sites), not some kind of site crawling or sh*t like that. It's completely legal. If people from stack overflow do not wish to publish content via RSS, they would not to do that. But they did. As many other sites, because it's good for them - traffic.

    And answer for your question, you are asking in wrong context. From context you speak, I should just say 'don't publish your data as RSS', because RSS was MADE for that - that's what devmeat is doing - RSS aggregation.

    Otherwise, if you want prevent people ripping your data (with bots/crawling) I think there's nothing you can do to be completely safe. Simply because you publish data on public place. I think ripping content IS bad. But you gave wrong example.

  • I would suggest that if you don't want the content to be repackaged, then you will have write your own client and transmit data to it in an encrypted fashion. E.g., I think a Java applet could do what you want, with a bit of rewriting on the text rendering to disallow copying & pasting.

    If you want to focus on providing a really great client and let the content slide, you run a big risk of some other 1-man shop developer doing a better job.

    Personally, I'd suggest locking the whole set of information and client down and getting a solid license.

  • I would disagree on principle, but I belief that Flash is the solution.

  • You should disable right mouse click as some sites do. Then you should also put strongly worded legal stuff : "Violating copyright laws is a serious thing."

    Kristen : BOTs use right click eh?

Database data needed in integration tests; created by API calls or using imported data?

This question is more or less programming language agnostic. However as I'm mostly into Java these days that's where I'll draw my examples from. I'm also thinking about the OOP case, so if you want to test a method you need an instance of that methods class.

A core rule for unit tests is that they should be autonomous, and that can be achieved by isolating a class from its dependencies. There are several ways to do it and it depends on if you inject your dependencies using IoC (in the Java world we have Spring, EJB3 and other frameworks/platforms which provide injection capabilities) and/or if you mock objects (for Java you have JMock and EasyMock) to separate a class being tested from its dependencies.

If we need to test groups of methods in different classes* and see that they are well integration, we write integration tests. And here is my question!

  • At least in web applications, state is often persisted to a database. We could use the same tools as for unit tests to achieve independence from the database. But in my humble opinion I think that there are cases when not using a database for integration tests is mocking too much (but feel free to disagree; not using a database at all, ever, is also a valid answer as it makes the question irrelevant).
  • When you use a database for integration tests, how do you fill that database with data? I can see two approaches:
    • Store the database contents for the integration test and load it before starting the test. If it's stored as an SQL dump, a database file, XML or something else would be interesting to know.
    • Create the necessary database structures by API calls. These calls are probably split up into several methods in your test code and each of these methods may fail. It could be seen as your integration test having dependencies on other tests.

How are you making certain that database data needed for tests is there when you need it? And why did you choose the method you choose?

Please provide an answer with a motivation, as it's in the motivation the interesting part lies. Remember that just saying "It's best practice!" isn't a real motivation, it's just re-iterating something you've read or heard from someone. For that case please explain why it's best practice.

*I'm including one method calling other methods in (the same or other) instances of the same class in my definition of unit test, even though it might technically not be correct. Feel free to correct me, but let's keep it as a side issue.

From stackoverflow
  • I generally use SQL scripts to fill the data in the scenario you discuss.

    It's straight-forward and very easily repeatable.

    DeletedAccount : But when the entities change you have to change the data in your SQL scripts as well. How come that has not been a problem for you?
  • This will probably not answer all your questions, if any, but I made the decision in one project to do unit testing against the DB. I felt in my case that the DB structure needed testing too, i.e. did my DB design deliver what is needed for the application. Later in the project when I feel the DB structure is stable, I will probably move away from this.

    To generate data I decided to create an external application that filled the DB with "random" data, I created a person-name and company-name generators etc.

    The reason for doing this in an external program was: 1. I could rerun the tests on by test modified data, i.e. making sure my tests were able to run several times and the data modification made by the tests were valid modifications. 2. I could if needed, clean the DB and get a fresh start.

    I agree that there are points of failure in this approach, but in my case since e.g. person generation was part of the business logic generating data for tests was actually testing that part too.

  • I do both, depending on what I need to test:

    • I import static test data from SQL scripts or DB dumps. This data is used in object load (deserialization or object mapping) and in SQL query tests (when I want to know whether the code will return the correct result).

      Plus, I usually have some backbone data (config, value to name lookup tables, etc). These are also loaded in this step. Note that this loading is a single test (along with creating the DB from scratch).

    • When I have code which modifies the DB (object -> DB), I usually run it against a living DB (in memory or a test instance somewhere). This is to ensure that the code works; not to create any large amount of rows. After the test, I rollback the transaction (following the rule that tests must not modify the global state).

    Of course, there are exceptions to the rule:

    • I also create large amount of rows in performance tests.
    • Sometimes, I have to commit the result of a unit test (otherwise, the test would grow too big).
  • I've used DBUnit to take snapshots of records in a database and store them in XML format. Then my unit tests (we called them integration tests when they required a database), can wipe and restore from the XML file at the start of each test.

    I'm undecided whether this is worth the effort. One problem is dependencies on other tables. We left static reference tables alone, and built some tools to detect and extract all child tables along with the requested records. I read someone's recommendation to disable all foreign keys in your integration test database. That would make it way easier to prepare the data, but you're no longer checking for any referential integrity problems in your tests.

    Another problem is database schema changes. We wrote some tools that would add default values for columns that had been added since the snapshots were taken.

    Obviously these tests were way slower than pure unit tests.

    When you're trying to test some legacy code where it's very difficult to write unit tests for individual classes, this approach may be worth the effort.

  • I prefer creating the test data using API calls.

    In the beginning of the test, you create an empty database (in-memory or the same that is used in production), run the install script to initialize it, and then create whatever test data used by the database. Creation of the test data may be organized for example with the Object Mother pattern, so that the same data can be reused in many tests, possibly with minor variations.

    You want to have the database in a known state before every test, in order to have reproducable tests without side effects. So when a test ends, you should drop the test database or roll back the transaction, so that the next test could recreate the test data always the same way, regardless of whether the previous tests passed or failed.

    The reason why I would avoid importing database dumps (or similar), is that it will couple the test data with the database schema. When the database schema changes, you would also need to change or recreate the test data, which may require manual work.

    If the test data is specified in code, you will have the power of your IDE's refactoring tools at your hand. When you make a change which affects the database schema, it will probably also affect the API calls, so you will anyways need to refactor the code using the API. With nearly the same effort you can also refactor the creation of the test data - especially if the refactoring can be automated (renames, introducing parameters etc.). But if the tests rely on a database dump, you would need to manually refactor the database dump in addition to refactoring the code which uses the API.

    Another thing related to integration testing the database, is testing that upgrading from a previous database schema works right. For that you might want to read the book Refactoring Databases: Evolutionary Database Design or this article: http://martinfowler.com/articles/evodb.html

  • It sounds like your question is actually two questions. Should you use exclude the database from your testing? When do you a database then how should you generate the data in the database?

    When possible I prefer to use an actual database. Frequently the queries (written in SQL, HQL, etc.) in CRUD classes can return surprising results when confronted with an actual database. It's better to flush these issues out early on. Often developers will write very thin unit tests for CRUD; testing only the most benign cases. Using an actual database for your testing can test all kinds corner cases you may not have even been aware of.

    That being said there can be other issues. After each test you want to return your database to a known state. It my current job we nuke the database by executing all the DROP statements and then completely recreating all the tables from scratch. This is extremely slow on Oracle, but can be very fast if you use an in memory database like HSQLDB. When we need to flush out Oracle specific issues we just change the database URL and driver properties and then run against Oracle. If you don't have this kind of database portability then Oracle also has some kind of database snapshot feature which can be used specifically for this purpose; rolling back the entire database to some previous state. I'm sure what other databases have.

    Depending on what kind of data will be in your database the API or the load approach may work better or worse. When you have highly structured data with many relations, APIs will make your life easer my making the relations between your data explicit. It will be harder for you to make a mistake when creating your test data set. As mentioned by other posters refactoring tools can take care of some of the changes to structure of your data automatically. Often I find it useful to think of API generated test data as composing a scenario; when a user/system has done steps X, Y Z and then tests will go from there. These states can be achieved because you can write a program that calls the same API your user would use.

    Loading data becomes much more important when you need large volumes of data, you have few relations between within your data or there is consistency in the data that can not be expressed using APIs or standard relational mechanisms. At one job that at worked at my team was writing the reporting application for a large network packet inspection system. The volume of data was quite large for the time. In order to trigger a useful subset of test cases we really needed test data generated by the sniffers. This way correlations between the information about one protocol would correlate with information about another protocol. It was difficult to capture this in the API.

    Most databases have tools to import and export delimited text files of tables. But often you only want subsets of them; making using data dumps more complicated. At my current job we need to take some dumps of actual data which gets generated by Matlab programs and stored in the database. We have tool which can dump a subset of the database data and then compare it with the "ground truth" for testing. It seems our extraction tools are being constantly modified.

  • Why are these two approaches defined as being exclusively?

    • I can't see any viable argument for not using pre-existing data sets, especially particular data that has caused problems in the past.

    • I can't see any viable argument for not programmatically extending that data with all the possible conditions that you can imagine causing problems and even a bit of random data for integration testing.

    In modern agile approaches, Unit tests are where it really matters that the same tests are run each time. This is because unit tests are aimed not at finding bugs but at preserving the functionality of the app as it is developed, allowing the developer to refactor as needed.

    Integration tests, on the other hand, are designed to find the bugs you did not expect. Running with some different data each time can even be good, in my opinion. You just have to make sure your test preserves the failing data if you get a failure. Remember, in formal integration testing, the application itself will be frozen except for bug fixes so your tests can be change to test for the maximum possible number and kinds of bugs. In integration, you can and should throw the kitchen sink at the app.

    As others have noted, of course, all this naturally depends on the kind of application that you are developing and the kind of organization you are in, etc.

  • In integration tests, you need to test with real database, as you have to verify that your application can actually talk to the database. Isolating the database as dependency means that you are postponing the real test of whether your database was deployed properly, your schema is as expected and your app is configured with the right connection string. You don't want to find any problems with these when you deploy to production.

    You also want to test with both precreated data sets and empty data set. You need to test both path where your app starts with an empty database with only your default initial data and starts creating and populating the data and also with a well-defined data sets that target specific conditions you want to test, like stress, performance and so on.

    Also, make sur that you have the database in a well-known state before each state. You don't want to have dependencies between your integration tests.