Finally,thereisthepotentialforsignificant“redtape”attheresearcher’shomeinstitution,which
ingeneral,musttobenavigatedbeforeanyworkbegins.Forinstance,theanalysisof
corporatedataoftenrequiressignaturesfromuniversityofficialsonnondisclosureorlegal
agreements.Becausetheserequiretheattentionofdifferentofficesinotherpartsofthe
universityorfromhigherlevelofficersattheuniversity,obtainingtheseapprovalscanbetime
consuming,andmayrequiresubstantialleadtime,normallycountedinmonths.Moreover,the
legalexpectationsofthedataprovidermaydiffersubstantiallyfromwhattheuniversitylegal
teamiswillingtoaccept,whichcanaddtimeandrisktothisprocess.Forinstance,data
providersmayhavemorestringentexpectationsintermsofwhoownsworkproduct,whether
theywillhavefinalreview,howthedatawillbestored,andsoon.Thelegaldesignationofthe
researcherwhethersheisacontractor,employee,orsomethingelsecanalsobeapointof
contention.
Finally,researchersalsogenerallyrequireInstitutionalReviewBoard(IRB)approvalattheir
homeinstitutions,whichgeneratesadditionaloverheadespeciallyinthecontextofexperiments
wheresubjectsare“manipulated”onlinetounderstandacausaleffect.
Challenges in Data Processing and Analysis
Althoughthedatasourcesdescribedabovearedisparateintheirnature,mostcompanieshave
theirdatabasesdesignedinwaysthatarebroadlysimilartablesofworkers,employers,job
openings,hoursworked,contributions,andsoon.Muchofthisdataisstoredintraditional
offtheshelfdatabasesandaccessingthedataviaSQLisstraightforward.Inthissense,these
datasourcesarenot"big"inthesensethatspecialskillsarerequiredtoextractorprocessthe
data.
However,mostofthecompaniesmentionedabovealsocollectmuchmorefinegraineddata
(suchasuserclickstreamdata,allactionsthataretakenwithinamobileapp,orGPS
coordinatesforeverysecondauserisactiveontheplatform).Thisancillarydataisoftenless
structuredandoftenisnotinananalyticallytractableformatandmayrequirespecialskillsto
accessthedata.Veryoften,itmaybeadatabasewithanidentifier,atimestampanddata
storedinalightweightinterchangeformat(e.g.JSON).Thesedatacanbemoredifficulttowork
withandareoftenseparatedfromthedatathathavedirectbusinessapplications.
Duetotheseparationofbusinessdataacrosstheseformatsaswellasthesizeandscopeof
thedatasets,thereissignificanttimerequiredforresearcherstoperform“dataforensics”.
Researchersmustinvesttimeindevelopinganunderstandingofwherethedataishighquality
(i.e.wherefieldsarewellpopulated,datanormalization,etc),andwherethedataarenoisyor
missing.Often,itisdifficulttoachieveanunderstandingofthedatageneratingprocesses
withoutaccesstotheengineersresponsiblefordesigningthesystemsthroughwhichthedata
werecollected.Thesefactorscaninfluencethetypesofquestionsthatcanbesuccessfully
answeredusingthedata.Understandingthetablestructure,theprimarykeys,theforeignkeys,
andsooncantakemonthsbeforeresearchprojectscanevenbegin.